Question Selection for Crowd Entity Resolution

نویسندگان

  • Steven Euijong Whang
  • Peter Lofgren
  • Hector Garcia-Molina
چکیده

We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely di cult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is #P-hard, so we propose approximation techniques for e cient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Data collection and annotation for state-of-the-art NER using unmanaged crowds

This paper presents strategies for generating entity level annotated text utterances using unmanaged crowds. These utterances are then used to build state-of-the-art Named Entity Recognition (NER) models, a required component to build dialogue systems. First, a wide variety of raw utterances are collected through a variant elicitation task. We ensure that these utterances are relevant by feedin...

متن کامل

A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Due to inherent ambiguity of data representation and poor data quality, ER is a challenging task for any automated process. As a remedy, human-powered ER via crowdsourcing has become popular in recent years. Using crowd to answer queri...

متن کامل

A Model for Processing Skyline Queries in Crowd-sourced Databases

Received Jan 15, 2018 Revised Mar 29, 2018 Accepted Apr 11, 2018 Nowadays, in most of the modern database applications, lots of critical queries and tasks cannot be completely addressed by machine. Crowdsourcing database has become a new paradigm for harness human cognitive abilities to process these computer hard tasks. In particular, those problems that are difficult for machines but easier f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2013